This project focuses on analyzing the historical gas price trend and predict the future gas price. As we know, the prices at the pump started rising once the lockdowns were lifted and have spiraled faster since the start of the war. Our idea is to predict weekly gas prices using indicators like crude oil price, stocks, import/export etc.

The data source details are mentioned as follows. Historical gas price - https://www.eia.gov/opendata/qb.php?sdid=PET.EMM_EPM0U_PTE_NUS_DPG.W Crude oil spot price - https://www.eia.gov/opendata/qb.php?sdid=PET.RWTC.D Stocks of gasoline - https://www.eia.gov/opendata/qb.php?sdid=PET.WGFSTUS1.W Supply of gasoline - https://www.eia.gov/opendata/qb.php?sdid=PET.W_EPM0_VSD_NUS_DAYS.W Refinery capacity - https://www.eia.gov/opendata/qb.php?sdid=PET.WPULEUS3.W Crude oil exports - https://www.eia.gov/opendata/qb.php?sdid=PET.WCREXUS2.W Crude oil imports - https://www.eia.gov/opendata/qb.php?sdid=PET.WCRIMUS2.W

Loading the required libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(TSstudio)
## Warning: package 'TSstudio' was built under R version 4.1.3
library(corrplot)
## corrplot 0.92 loaded
library(skimr)
library(tidyr)
library(stringr)
library(caret)
## Loading required package: lattice
library(MLmetrics) 
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(forcats)
library(car)
## Warning: package 'car' was built under R version 4.1.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

Load the gas price merged data

gas_price <- read.csv("./data/Gas_Price_Data_Merged.csv")

Check the structure of the gas price merged data

print(str(gas_price))
## 'data.frame':    1006 obs. of  9 variables:
##  $ Date                        : chr  "03/25/2022" "03/18/2022" "03/11/2022" "02/25/2022" ...
##  $ Crude_price_per_barrel      : num  113.7 100.4 113.4 92.2 92.9 ...
##  $ Stocks_per_thousand_barrels : int  18971 18344 17123 17714 17624 17746 18979 19797 19732 16986 ...
##  $ No_of_days                  : num  27.3 27 27.3 28 28.5 28.7 30.4 30.2 29 25.6 ...
##  $ Utilization_percentage      : num  92.1 91.1 90.4 87.7 87.4 85.3 86.7 87.7 88.1 89.8 ...
##  $ Exp_thousand_barrels_per_day: int  2988 3844 2936 3796 2686 2271 2376 2796 2610 2554 ...
##  $ Imp_thousand_barrels_per_day: int  6259 6486 6395 5767 6828 5790 7085 6236 6745 5884 ...
##  $ Gas_price_date              : chr  "03/28/2022" "03/21/2022" "03/14/2022" "02/28/2022" ...
##  $ Gas_price_per_gallon        : num  4.15 4.16 4.25 3.55 3.48 ...
## NULL

Convert Date and Gas_price_date as Date

gas_price$Date <- as.Date(gas_price$Date, format="%m/%d/%Y")

gas_price$Gas_price_date<-as.Date(gas_price$Gas_price_date, format="%m/%d/%Y")
print(str(gas_price))
## 'data.frame':    1006 obs. of  9 variables:
##  $ Date                        : Date, format: "2022-03-25" "2022-03-18" ...
##  $ Crude_price_per_barrel      : num  113.7 100.4 113.4 92.2 92.9 ...
##  $ Stocks_per_thousand_barrels : int  18971 18344 17123 17714 17624 17746 18979 19797 19732 16986 ...
##  $ No_of_days                  : num  27.3 27 27.3 28 28.5 28.7 30.4 30.2 29 25.6 ...
##  $ Utilization_percentage      : num  92.1 91.1 90.4 87.7 87.4 85.3 86.7 87.7 88.1 89.8 ...
##  $ Exp_thousand_barrels_per_day: int  2988 3844 2936 3796 2686 2271 2376 2796 2610 2554 ...
##  $ Imp_thousand_barrels_per_day: int  6259 6486 6395 5767 6828 5790 7085 6236 6745 5884 ...
##  $ Gas_price_date              : Date, format: "2022-03-28" "2022-03-21" ...
##  $ Gas_price_per_gallon        : num  4.15 4.16 4.25 3.55 3.48 ...
## NULL

Summary stats of the gas price merged data

summary(gas_price)
##       Date            Crude_price_per_barrel Stocks_per_thousand_barrels
##  Min.   :1994-11-25   Min.   :  3.32         Min.   : 16585             
##  1st Qu.:2001-09-22   1st Qu.: 27.87         1st Qu.: 27932             
##  Median :2008-08-04   Median : 50.77         Median :101114             
##  Mean   :2008-07-28   Mean   : 53.45         Mean   : 94755             
##  3rd Qu.:2015-05-27   3rd Qu.: 72.92         3rd Qu.:153529             
##  Max.   :2022-03-25   Max.   :139.95         Max.   :180559             
##    No_of_days    Utilization_percentage Exp_thousand_barrels_per_day
##  Min.   :19.80   Min.   : 56.00         Min.   :  10.0              
##  1st Qu.:23.10   1st Qu.: 87.10         1st Qu.:  30.0              
##  Median :24.20   Median : 90.35         Median :  95.0              
##  Mean   :24.62   Mean   : 89.69         Mean   : 561.2              
##  3rd Qu.:25.60   3rd Qu.: 93.30         3rd Qu.: 420.0              
##  Max.   :48.70   Max.   :100.50         Max.   :4462.0              
##  Imp_thousand_barrels_per_day Gas_price_date       Gas_price_per_gallon
##  Min.   : 4599                Min.   :1994-11-28   Min.   :0.926       
##  1st Qu.: 7514                1st Qu.:2001-09-25   1st Qu.:1.452       
##  Median : 8388                Median :2008-08-07   Median :2.292       
##  Mean   : 8404                Mean   :2008-07-31   Mean   :2.270       
##  3rd Qu.: 9414                3rd Qu.:2015-05-30   3rd Qu.:2.858       
##  Max.   :11324                Max.   :2022-03-28   Max.   :4.252

Each row of this dataset represents weekly historical information on variables that affect gas prices, from 1994 to 2022. These variables include the crude price per barrel, stocks per thousand barrels, gas price per gallon, and more. There are not any NA values in this dataset, which is also demonstrated below.

any(is.na(gas_price))
## [1] FALSE

Equivalently,

all(!is.na(gas_price))
## [1] TRUE

Basic statistics for gas price

Here we show summary statistics for gas prices throughout this dataset.

## Summary Statistics
cat(sprintf("Mean and Median for gas prices are: %.2f and %.2f \n", mean(gas_price$Gas_price_per_gallon), median(gas_price$Gas_price_per_gallon)))
## Mean and Median for gas prices are: 2.27 and 2.29
cat(sprintf("Standard Deviation for gas price is: %.2f \n", sd(gas_price$Gas_price_per_gallon)))
## Standard Deviation for gas price is: 0.85
cat(sprintf("Min, Max and Range for gas prices are: %.2f, %.2f and %.2f \n", min(gas_price$Gas_price_per_gallon), max(gas_price$Gas_price_per_gallon), max(gas_price$Gas_price_per_gallon)-min(gas_price$Gas_price_per_gallon)))
## Min, Max and Range for gas prices are: 0.93, 4.25 and 3.33
cat(sprintf("95th and 99th percentile for gas prices are: %.2f and %.2f \n", quantile(gas_price$Gas_price_per_gallon, 0.95), quantile(gas_price$Gas_price_per_gallon, 0.99)))
## 95th and 99th percentile for gas prices are: 3.70 and 3.96
cat(sprintf("IQR for gas price is: %.2f \n", IQR(gas_price$Gas_price_per_gallon)))
## IQR for gas price is: 1.41
cat(sprintf("Coefficient of variation for gas price is: %.2f \n", sd(gas_price$Gas_price_per_gallon)/mean(gas_price$Gas_price_per_gallon)))
## Coefficient of variation for gas price is: 0.37

Exploratory data analysis

gas_price %>% 
  arrange(desc(Gas_price_per_gallon)) %>% 
  head(n=10)
##          Date Crude_price_per_barrel Stocks_per_thousand_barrels No_of_days
## 1  2022-03-11                 113.39                       17123       27.3
## 2  2022-03-18                 100.43                       18344       27.0
## 3  2022-03-25                 113.69                       18971       27.3
## 4  2008-07-11                 139.95                      107192       22.9
## 5  2008-06-27                 137.00                      106707       22.6
## 6  2008-06-13                 134.80                      106241       22.5
## 7  2008-07-18                 135.37                      108249       23.2
## 8  2008-06-20                 134.34                      106292       22.5
## 9  2008-05-30                 128.47                      105616       22.5
## 10 2008-05-23                 130.14                      103839       22.1
##    Utilization_percentage Exp_thousand_barrels_per_day
## 1                    90.4                         2936
## 2                    91.1                         3844
## 3                    92.1                         2988
## 4                    89.5                           27
## 5                    89.2                           27
## 6                    89.3                           27
## 7                    87.1                           27
## 8                    88.6                           27
## 9                    89.7                           27
## 10                   87.9                           26
##    Imp_thousand_barrels_per_day Gas_price_date Gas_price_per_gallon
## 1                          6395     2022-03-14                4.252
## 2                          6486     2022-03-21                4.165
## 3                          6259     2022-03-28                4.152
## 4                         10791     2008-07-14                4.102
## 5                         10168     2008-06-30                4.075
## 6                         10259     2008-06-16                4.056
## 7                          9806     2008-07-21                4.054
## 8                         10251     2008-06-23                4.051
## 9                          9786     2008-06-02                3.980
## 10                         8959     2008-05-26                3.960

From the above table, it is evident that the gas prices were high during March 2022 and throughout 2008. Looking at some external factors that occurred during this period of time, it makes sense why these would consist of the highest gas prices. In 2008, the energy crisis was coming to an end, and the crude price per barrel hit its peak. Factors such as a surge in demand and international tension led gas prices to be higher. In 2022, certain factors such as decreased oil production due to COVID, sanctions on Russian oil, and post-pandemic demand increases, gas prices have been steadily rising throughout March and much of April. When evaluating gas prices, it’s important to analyze various factors such as these, in addition to factors such as the crude price per barrel, stocks, etc.

Time Series Plots

ts_plot(gas_price)
## Warning in ts_plot(gas_price): There are multipe 'date' objects in the data
## frame,using the first one object as the plot index

We also created an interactive time series plot which can be used to select between all of the variables in the data set to view the trends. Crude price per barrel, for example, had a similar trend as the gas prices did, which makes sense given that they are highly correlated. There was a sharp peak and decrease in 2008, followed by a short increases and decreases before increasing steadily in 2021. You can also see that the stocks variable had a decline throughout the time period in which this data was obtained.

ts_gas_price <- gas_price %>% 
  select(Date, Gas_price_per_gallon)
ts_plot(ts_gas_price)

Histograms, Density Plots, Box Plots and Violin Plots

In this next section, we created graphs to show the distribution of data for the various variables. We created histograms and density plots, as well as boxplots and violin plots.

ggplot(gas_price) + geom_histogram(aes(x=Gas_price_per_gallon), fill="#FF9999") +
  ggtitle("Distribution of gas prices", subtitle = "U.S. Weekly Retail Gas Prices (1994-2022)") +
  xlab("Selling price") + 
  ylab("Number of gallons")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(gas_price) + geom_density(aes(x=Gas_price_per_gallon))

For the price of gas per gallon, most of the data is below $3.00 per gallon, with a max of $4.25 per gallon.

ggplot(gas_price) + geom_histogram(aes(x=Crude_price_per_barrel), fill="#FF9999") +
  ggtitle("Distribution of crude oil prices", subtitle = "U.S. daily Crude oil Prices (1994-2022)") +
  xlab("Selling price") + 
  ylab("Number of barrels")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(gas_price) + geom_density(aes(x=Crude_price_per_barrel))

For crude price per barrel, most of these were in the lower range as well, with the majority of these at less than a $100 per barrel price point. The peak was at $20 per barrel, and the maximum selling price was approximately $145 per barrel.

ggplot(gas_price) + geom_histogram(aes(x=Stocks_per_thousand_barrels), fill="#FF9999") +
  ggtitle("Distribution of stocks of gasoline", subtitle = "U.S. weekly stocks of gasoline (1994-2022)") +
  xlab("Stocks value") + 
  ylab("Number of thousand barrels")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(gas_price) + geom_density(aes(x=Stocks_per_thousand_barrels))

The distribution of stocks per thousand barrels is bimodal, with extremes on both ends of the range and a dip in the middle. There is a wide range for this data, from approximately $20,000 to $80,000 per thousand barrels.

ggplot(data = gas_price) +
  geom_boxplot(aes(y=Gas_price_per_gallon, x=Stocks_per_thousand_barrels))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = gas_price) +
  geom_boxplot(aes(y=Crude_price_per_barrel, x=Stocks_per_thousand_barrels))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = gas_price) +
  geom_boxplot(aes(y=Crude_price_per_barrel, x=Gas_price_per_gallon))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(data = gas_price) +
  geom_violin(aes(y=Crude_price_per_barrel, x=Gas_price_per_gallon))

skim(gas_price)
Data summary
Name gas_price
Number of rows 1006
Number of columns 9
_______________________
Column type frequency:
Date 2
numeric 7
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 1994-11-25 2022-03-25 2008-08-04 1006
Gas_price_date 0 1 1994-11-28 2022-03-28 2008-08-07 1006

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Crude_price_per_barrel 0 1 53.45 28.54 3.32 27.87 50.78 72.93 139.95 ▇▇▆▅▁
Stocks_per_thousand_barrels 0 1 94755.21 57304.59 16585.00 27931.50 101114.00 153529.00 180559.00 ▇▃▃▃▇
No_of_days 0 1 24.62 2.54 19.80 23.10 24.20 25.60 48.70 ▇▂▁▁▁
Utilization_percentage 0 1 89.69 5.13 56.00 87.10 90.35 93.30 100.50 ▁▁▁▇▆
Exp_thousand_barrels_per_day 0 1 561.20 1034.38 10.00 30.00 95.00 420.00 4462.00 ▇▁▁▁▁
Imp_thousand_barrels_per_day 0 1 8403.97 1318.47 4599.00 7514.25 8387.50 9414.50 11324.00 ▁▃▇▇▃
Gas_price_per_gallon 0 1 2.27 0.85 0.93 1.45 2.29 2.86 4.25 ▇▅▇▃▂

The above Skim function shows us the summary of our dataset in terms of the type of columns in our dataframe.

ggplot(gas_price, aes(x = Exp_thousand_barrels_per_day)) + 
  ggtitle("Distribution of exported barrels", subtitle = "U.S. Weekly Exports of Crude Oil (1994-2022)") +
  xlab("Amount exported") +
  ylab("Density") +
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "#FF9999") +
  geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For the distribution of exported barrels, there is a right-skewed distribution as seen above, showing that there was typically a low amount of exported barrels.

ggplot(gas_price, aes(x = Imp_thousand_barrels_per_day)) + 
  ggtitle("Distribution of imported barrels", subtitle = "U.S. Weekly Imports of Crude Oil (1994-2022)") +
  xlab("Amount imported") +
  ylab("Density") +
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "#FF9999") +
  geom_density()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The amount imported has a slight left-skew, but this is less extreme than the skew on the amount imported.

ggplot(gas_price) + geom_density(aes(x=Utilization_percentage), fill="#FF9999") +
  ggtitle("Percent Utilization of Refinery Operable Capacity", subtitle = "U.S. Weekly Percent Utilization (1994-2022)") +
  xlab("Percent utilization") +
  ylab("Density")

The percent utilization has a left skew, showing that there was typically a high percent utilization of the refinery operable capacity.

ggplot(gas_price) + geom_density(aes(x=No_of_days), fill="#FF9999") +
  ggtitle("Days of supply of total gasoline", subtitle = "U.S. Weekly Days of Total Supply (1994-2022") +
  xlab("Number of days") +
  ylab("Density")

The days of supply variable has a right skew, and is centered around the 25 day mark. This measures the adequacy of inventories, and remains relatively consistent throughtout the data, primarily between the 20 and 30 day marks.

Correlation Plot

cols_for_corr <- c(2:7, 9)

gasprice_cor <- cor(gas_price[, cols_for_corr])
gasprice_cor
##                              Crude_price_per_barrel Stocks_per_thousand_barrels
## Crude_price_per_barrel                   1.00000000                  -0.5947264
## Stocks_per_thousand_barrels             -0.59472636                   1.0000000
## No_of_days                              -0.18501640                  -0.1964410
## Utilization_percentage                  -0.35799013                   0.3475956
## Exp_thousand_barrels_per_day             0.05832063                  -0.6024818
## Imp_thousand_barrels_per_day             0.06797031                   0.4708912
## Gas_price_per_gallon                     0.96773237                  -0.7217388
##                              No_of_days Utilization_percentage
## Crude_price_per_barrel       -0.1850164             -0.3579901
## Stocks_per_thousand_barrels  -0.1964410              0.3475956
## No_of_days                    1.0000000             -0.3143965
## Utilization_percentage       -0.3143965              1.0000000
## Exp_thousand_barrels_per_day  0.4088613             -0.2065698
## Imp_thousand_barrels_per_day -0.5678313              0.2243003
## Gas_price_per_gallon         -0.1388806             -0.3598479
##                              Exp_thousand_barrels_per_day
## Crude_price_per_barrel                         0.05832063
## Stocks_per_thousand_barrels                   -0.60248177
## No_of_days                                     0.40886134
## Utilization_percentage                        -0.20656981
## Exp_thousand_barrels_per_day                   1.00000000
## Imp_thousand_barrels_per_day                  -0.64120019
## Gas_price_per_gallon                           0.19010678
##                              Imp_thousand_barrels_per_day Gas_price_per_gallon
## Crude_price_per_barrel                         0.06797031           0.96773237
## Stocks_per_thousand_barrels                    0.47089115          -0.72173878
## No_of_days                                    -0.56783132          -0.13888060
## Utilization_percentage                         0.22430029          -0.35984791
## Exp_thousand_barrels_per_day                  -0.64120019           0.19010678
## Imp_thousand_barrels_per_day                   1.00000000          -0.03379215
## Gas_price_per_gallon                          -0.03379215           1.00000000
corrplot::corrplot(gasprice_cor)

The above correlation plot exhibits the relationship between the variables of our dataset. We can see that Crude Price and Stocks have a stronger correlation with Gas Price, the Crude Price has a strong Positive correlation whereas Stocks has a strong Negative correlation with Gas Price.

Train Test Split

# Simple partition into train (70%) and test (30%) set 
set.seed(447)
trainIndex <- createDataPartition(gas_price$Gas_price_per_gallon, p = .7, 
                                  list = FALSE, 
                                  times = 1)

gas_price_train <- gas_price[as.vector(trainIndex), ]  
gas_price_test <- gas_price[-as.vector(trainIndex), ]

EDA on training data

Scatter plots

ggplot(gas_price_train) + geom_point(aes(x=Crude_price_per_barrel, y=Gas_price_per_gallon))

ggplot(gas_price_train) + geom_point(aes(x=Stocks_per_thousand_barrels, y=Gas_price_per_gallon))

Model Building and evaluation

Null model

mean(gas_price_train$Gas_price_per_gallon)
## [1] 2.273307
gasprice_lm0 <- lm(Gas_price_per_gallon ~ 1, data = gas_price_train)
summary(gasprice_lm0)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ 1, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34731 -0.82206  0.01869  0.58519  1.97869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.27331    0.03224   70.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8565 on 705 degrees of freedom
# Compute overall mean Gas Price
null_pred <- mean(gas_price_train$Gas_price_per_gallon)
sprintf("Null model prediction: %.2f", null_pred)
## [1] "Null model prediction: 2.27"
# Compute null model MAE on train
null_train_mae <- MAE(gas_price_train$Gas_price_per_gallon, null_pred)
sprintf("Null model train MAE: %.2f", null_train_mae)
## [1] "Null model train MAE: 0.73"
# Compute null model MAE on test
null_test_mae <- MAE(gas_price_test$Gas_price_per_gallon, null_pred)
sprintf("Null model test MAE: %.2f", null_test_mae)
## [1] "Null model test MAE: 0.72"

Model-1: Linear Regression Model considering all the independent variables

gasprice_lm1 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel    + Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage +  Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm1)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel + 
##     Stocks_per_thousand_barrels + No_of_days + Utilization_percentage + 
##     Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, 
##     data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41574 -0.08708 -0.01783  0.05478  0.49003 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.041e+00  1.714e-01   6.076 2.02e-09 ***
## Crude_price_per_barrel        2.478e-02  2.928e-04  84.610  < 2e-16 ***
## Stocks_per_thousand_barrels  -3.387e-06  1.723e-07 -19.664  < 2e-16 ***
## No_of_days                   -7.583e-03  2.844e-03  -2.666  0.00785 ** 
## Utilization_percentage        3.616e-03  1.241e-03   2.913  0.00370 ** 
## Exp_thousand_barrels_per_day  2.216e-05  7.934e-06   2.794  0.00536 ** 
## Imp_thousand_barrels_per_day  8.294e-06  6.115e-06   1.356  0.17545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1398 on 699 degrees of freedom
## Multiple R-squared:  0.9736, Adjusted R-squared:  0.9734 
## F-statistic:  4293 on 6 and 699 DF,  p-value: < 2.2e-16
vif(gasprice_lm1)
##       Crude_price_per_barrel  Stocks_per_thousand_barrels 
##                     2.578147                     3.497961 
##                   No_of_days       Utilization_percentage 
##                     1.732759                     1.437350 
## Exp_thousand_barrels_per_day Imp_thousand_barrels_per_day 
##                     2.309302                     2.386561

as we can see in the above Variance Inflation Factor the values for the independent variables were less than 5 which indicates that there is no issue of multicollinearity with this model.

Compute MAE for the fitted model on training data

mae_train <- c(MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm0$fitted.values),
                MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm1$fitted.values)
)

mae_train
## [1] 0.7285952 0.1031485
mae_train_lm1 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm1$fitted.values)

mae_train_lm1
## [1] 0.1031485

Use fitted model to make predictions on test data

predict_lm1 <- predict(gasprice_lm1, newdata = gas_price_test)

Compute MAE for the predictions on the test data

mae_test <- c(MAE(gas_price_test$Gas_price_per_gallon, null_pred),
                MAE(gas_price_test$Gas_price_per_gallon, predict_lm1)
)

mae_test
## [1] 0.7208292 0.1081121

The gasprice_lm1 fits pretty well as it has r-square value of 0.9736 which is closer to 1 and almost all of them are significant variables except for import data variable. Whereas the MAE for gasprice_lm1 is lower than MAE for gasprice_lm0 since gasprice_lm1 contains additional variables.

More Model building

Model-2: Linear Regression Model considering stocks, no of days of supply, refinary capacity, export data and import data

gasprice_lm2 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage +  Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)

summary(gasprice_lm2)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels + 
##     No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + 
##     Imp_thousand_barrels_per_day, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25836 -0.27712 -0.03198  0.29967  1.82858 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.576e+00  5.307e-01  12.391  < 2e-16 ***
## Stocks_per_thousand_barrels  -1.349e-05  4.161e-07 -32.410  < 2e-16 ***
## No_of_days                   -5.415e-02  9.349e-03  -5.792 1.05e-08 ***
## Utilization_percentage       -3.077e-02  3.930e-03  -7.829 1.83e-14 ***
## Exp_thousand_barrels_per_day -1.622e-04  2.556e-05  -6.346 3.98e-10 ***
## Imp_thousand_barrels_per_day  1.382e-04  1.983e-05   6.971 7.30e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4684 on 700 degrees of freedom
## Multiple R-squared:  0.703,  Adjusted R-squared:  0.7009 
## F-statistic: 331.4 on 5 and 700 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm2 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm2$fitted.values)

mae_train_lm2
## [1] 0.3571621

Model-3: Linear Regression Model considering stocks, no of days of supply, refinary capacity, and export data

gasprice_lm3 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage +  Exp_thousand_barrels_per_day, data = gas_price_train)

summary(gasprice_lm3)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels + 
##     No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day, 
##     data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15973 -0.30274 -0.02469  0.30241  1.91508 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   8.615e+00  4.575e-01  18.832  < 2e-16 ***
## Stocks_per_thousand_barrels  -1.289e-05  4.208e-07 -30.630  < 2e-16 ***
## No_of_days                   -8.409e-02  8.581e-03  -9.799  < 2e-16 ***
## Utilization_percentage       -3.252e-02  4.053e-03  -8.024 4.32e-15 ***
## Exp_thousand_barrels_per_day -2.331e-04  2.423e-05  -9.618  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4841 on 701 degrees of freedom
## Multiple R-squared:  0.6824, Adjusted R-squared:  0.6806 
## F-statistic: 376.5 on 4 and 701 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm3 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm3$fitted.values)

mae_train_lm3
## [1] 0.3745696

Model-4: Linear Regression Model considering stocks, no of days of supply, and refinary capacity

gasprice_lm4 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage, data = gas_price_train)

summary(gasprice_lm4)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels + 
##     No_of_days + Utilization_percentage, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29993 -0.33327 -0.07542  0.29519  1.65483 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  9.438e+00  4.778e-01  19.752   <2e-16 ***
## Stocks_per_thousand_barrels -1.057e-05  3.668e-07 -28.824   <2e-16 ***
## No_of_days                  -1.159e-01  8.417e-03 -13.772   <2e-16 ***
## Utilization_percentage      -3.680e-02  4.283e-03  -8.591   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5147 on 702 degrees of freedom
## Multiple R-squared:  0.6405, Adjusted R-squared:  0.6389 
## F-statistic: 416.9 on 3 and 702 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm4 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm4$fitted.values)

mae_train_lm4
## [1] 0.4032406

Model-5: Linear Regression Model considering stocks and no of days of supply

gasprice_lm5 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)

summary(gasprice_lm5)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels + 
##     No_of_days, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10842 -0.38376 -0.07734  0.32972  1.79376 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.744e+00  2.190e-01   26.23   <2e-16 ***
## Stocks_per_thousand_barrels -1.165e-05  3.621e-07  -32.17   <2e-16 ***
## No_of_days                  -9.582e-02  8.494e-03  -11.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5407 on 703 degrees of freedom
## Multiple R-squared:  0.6027, Adjusted R-squared:  0.6015 
## F-statistic: 533.2 on 2 and 703 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm5 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm5$fitted.values)

mae_train_lm5
## [1] 0.429309

Model-6: Linear Regression Model considering only stocks as independent variable

gasprice_lm6 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels, data = gas_price_train)

summary(gasprice_lm6)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels, 
##     data = gas_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2823 -0.4074 -0.1600  0.4253  1.9211 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.314e+00  4.300e-02   77.07   <2e-16 ***
## Stocks_per_thousand_barrels -1.091e-05  3.868e-07  -28.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5871 on 704 degrees of freedom
## Multiple R-squared:  0.5307, Adjusted R-squared:  0.5301 
## F-statistic: 796.3 on 1 and 704 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm6 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm6$fitted.values)

mae_train_lm6
## [1] 0.4858075

Model-7: Linear Regression Model considering stocks, no of days of supply, refinary capacity, export data and crude price

gasprice_lm7 <- lm(Gas_price_per_gallon ~ Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage +  Exp_thousand_barrels_per_day + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm7)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Stocks_per_thousand_barrels + 
##     No_of_days + Utilization_percentage + Exp_thousand_barrels_per_day + 
##     Crude_price_per_barrel, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41145 -0.08643 -0.01746  0.05633  0.49323 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.134e+00  1.573e-01   7.205 1.51e-12 ***
## Stocks_per_thousand_barrels  -3.313e-06  1.634e-07 -20.270  < 2e-16 ***
## No_of_days                   -9.078e-03  2.623e-03  -3.460 0.000572 ***
## Utilization_percentage        3.656e-03  1.242e-03   2.944 0.003348 ** 
## Exp_thousand_barrels_per_day  1.892e-05  7.570e-06   2.500 0.012659 *  
## Crude_price_per_barrel        2.488e-02  2.836e-04  87.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1399 on 700 degrees of freedom
## Multiple R-squared:  0.9735, Adjusted R-squared:  0.9733 
## F-statistic:  5145 on 5 and 700 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm7 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm7$fitted.values)

mae_train_lm7
## [1] 0.1034463

Model-8: Linear Regression Model considering crude price, stocks, no of days of supply and refinary capacity

gasprice_lm8 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel    + Stocks_per_thousand_barrels + No_of_days  + Utilization_percentage, data = gas_price_train)
summary(gasprice_lm8)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel + 
##     Stocks_per_thousand_barrels + No_of_days + Utilization_percentage, 
##     data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41894 -0.08305 -0.01882  0.06074  0.48492 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.157e+00  1.576e-01   7.342 5.87e-13 ***
## Crude_price_per_barrel       2.461e-02  2.634e-04  93.431  < 2e-16 ***
## Stocks_per_thousand_barrels -3.578e-06  1.250e-07 -28.627  < 2e-16 ***
## No_of_days                  -7.677e-03  2.572e-03  -2.985  0.00294 ** 
## Utilization_percentage       3.561e-03  1.246e-03   2.859  0.00438 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1404 on 701 degrees of freedom
## Multiple R-squared:  0.9733, Adjusted R-squared:  0.9731 
## F-statistic:  6382 on 4 and 701 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm8 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm8$fitted.values)

mae_train_lm8
## [1] 0.1046638

Model-9: Linear Regression Model considering crude price, stocks and no of days of supply

gasprice_lm9 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel    + Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)
summary(gasprice_lm9)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel + 
##     Stocks_per_thousand_barrels + No_of_days, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44504 -0.08533 -0.01868  0.06310  0.51012 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.560e+00  7.135e-02  21.861  < 2e-16 ***
## Crude_price_per_barrel       2.435e-02  2.483e-04  98.053  < 2e-16 ***
## Stocks_per_thousand_barrels -3.560e-06  1.255e-07 -28.376  < 2e-16 ***
## No_of_days                  -1.054e-02  2.382e-03  -4.424 1.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1411 on 702 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.9728 
## F-statistic:  8421 on 3 and 702 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm9 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm9$fitted.values)

mae_train_lm9
## [1] 0.1053172

Model-10: Linear Regression Model considering crude price and stocks

gasprice_lm10 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel   + Stocks_per_thousand_barrels, data = gas_price_train)
summary(gasprice_lm10)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel + 
##     Stocks_per_thousand_barrels, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41985 -0.08728 -0.01990  0.05855  0.49917 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.259e+00  2.209e-02   57.01   <2e-16 ***
## Crude_price_per_barrel       2.475e-02  2.342e-04  105.68   <2e-16 ***
## Stocks_per_thousand_barrels -3.357e-06  1.183e-07  -28.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.143 on 703 degrees of freedom
## Multiple R-squared:  0.9722, Adjusted R-squared:  0.9721 
## F-statistic: 1.23e+04 on 2 and 703 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm10 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm10$fitted.values)

mae_train_lm10
## [1] 0.1052105

Model-11: Linear Regression Model considering only the crude price as independent variable

gasprice_lm11 <- lm(Gas_price_per_gallon ~ Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm11)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Crude_price_per_barrel, data = gas_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5648 -0.1388 -0.0510  0.1604  0.7260 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.722287   0.016696   43.26   <2e-16 ***
## Crude_price_per_barrel 0.028765   0.000273  105.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2093 on 704 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9403 
## F-statistic: 1.11e+04 on 1 and 704 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm11 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm11$fitted.values)

mae_train_lm11
## [1] 0.1704284

Model-12: Linear Regression Model considering no of days of supply, refinery capacity, export data, import data and crude price

gasprice_lm12 <- lm(Gas_price_per_gallon ~ No_of_days   + Utilization_percentage +  Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm12)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage + 
##     Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day + 
##     Crude_price_per_barrel, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47608 -0.10762 -0.03560  0.09335  0.66852 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   8.552e-01  2.131e-01   4.013 6.63e-05 ***
## No_of_days                   -9.652e-03  3.540e-03  -2.727  0.00655 ** 
## Utilization_percentage        3.388e-03  1.546e-03   2.191  0.02875 *  
## Exp_thousand_barrels_per_day  9.864e-05  8.612e-06  11.453  < 2e-16 ***
## Imp_thousand_barrels_per_day -2.986e-05  7.222e-06  -4.135 3.98e-05 ***
## Crude_price_per_barrel        2.877e-02  2.629e-04 109.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1741 on 700 degrees of freedom
## Multiple R-squared:  0.959,  Adjusted R-squared:  0.9587 
## F-statistic:  3272 on 5 and 700 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm12 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm12$fitted.values)

mae_train_lm12
## [1] 0.1335655

Model-13: Linear Regression Model considering no of days of supply, refinery capacity, export and import data

gasprice_lm13 <- lm(Gas_price_per_gallon ~ No_of_days   + Utilization_percentage +  Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm13)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage + 
##     Exp_thousand_barrels_per_day + Imp_thousand_barrels_per_day, 
##     data = gas_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7298 -0.5789 -0.1056  0.5001  1.9294 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.198e+01  7.961e-01  15.044  < 2e-16 ***
## No_of_days                   -1.274e-01  1.433e-02  -8.891  < 2e-16 ***
## Utilization_percentage       -7.492e-02  5.825e-03 -12.862  < 2e-16 ***
## Exp_thousand_barrels_per_day  1.961e-04  3.642e-05   5.386 9.85e-08 ***
## Imp_thousand_barrels_per_day  6.274e-06  3.067e-05   0.205    0.838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7402 on 701 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2531 
## F-statistic: 60.73 on 4 and 701 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm13 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm13$fitted.values)

mae_train_lm13
## [1] 0.6063415

Model-14: Linear Regression Model considering no of days of supply, refinery capacity, and export data

gasprice_lm14 <- lm(Gas_price_per_gallon ~ No_of_days   + Utilization_percentage +  Exp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm14)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage + 
##     Exp_thousand_barrels_per_day, data = gas_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7276 -0.5793 -0.1012  0.4996  1.9293 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.206e+01  6.776e-01  17.801  < 2e-16 ***
## No_of_days                   -1.287e-01  1.292e-02  -9.960  < 2e-16 ***
## Utilization_percentage       -7.491e-02  5.821e-03 -12.869  < 2e-16 ***
## Exp_thousand_barrels_per_day  1.920e-04  3.036e-05   6.326 4.49e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7397 on 702 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2541 
## F-statistic: 81.07 on 3 and 702 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm14 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm14$fitted.values)

mae_train_lm14
## [1] 0.6064749

Model-15: Linear Regression Model considering no of days of supply and refinery capacity

gasprice_lm15 <- lm(Gas_price_per_gallon ~ No_of_days   + Utilization_percentage, data = gas_price_train)
summary(gasprice_lm15)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days + Utilization_percentage, 
##     data = gas_price_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8058 -0.6348 -0.1168  0.5862  2.3328 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.813596   0.694978  16.999  < 2e-16 ***
## No_of_days             -0.099518   0.012401  -8.025 4.26e-15 ***
## Utilization_percentage -0.079018   0.005943 -13.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.76 on 703 degrees of freedom
## Multiple R-squared:  0.215,  Adjusted R-squared:  0.2127 
## F-statistic: 96.25 on 2 and 703 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm15 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm15$fitted.values)

mae_train_lm15
## [1] 0.6472869

Model-16: Linear Regression Model considering only no of days of supply as independent variable

gasprice_lm16 <- lm(Gas_price_per_gallon ~ No_of_days, data = gas_price_train)
summary(gasprice_lm16)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ No_of_days, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.28700 -0.88130  0.04211  0.56481  2.10313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.42011    0.32490  10.527  < 2e-16 ***
## No_of_days  -0.04657    0.01313  -3.547 0.000416 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8496 on 704 degrees of freedom
## Multiple R-squared:  0.01756,    Adjusted R-squared:  0.01616 
## F-statistic: 12.58 on 1 and 704 DF,  p-value: 0.0004157

Compute MAE for the fitted model on training data

mae_train_lm16 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm16$fitted.values)

mae_train_lm16
## [1] 0.725579

Model-17: Linear Regression Model considering refinery capacity, crude price, stocks, no of days of supply and import data

gasprice_lm17 <- lm(Gas_price_per_gallon ~ Utilization_percentage + Crude_price_per_barrel  + Stocks_per_thousand_barrels + No_of_days  + Imp_thousand_barrels_per_day, data = gas_price_train)
summary(gasprice_lm17)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage + 
##     Crude_price_per_barrel + Stocks_per_thousand_barrels + No_of_days + 
##     Imp_thousand_barrels_per_day, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42106 -0.08349 -0.01973  0.05977  0.48856 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.124e+00  1.696e-01   6.626 6.88e-11 ***
## Utilization_percentage        3.540e-03  1.247e-03   2.839  0.00466 ** 
## Crude_price_per_barrel        2.455e-02  2.829e-04  86.775  < 2e-16 ***
## Stocks_per_thousand_barrels  -3.623e-06  1.509e-07 -24.015  < 2e-16 ***
## No_of_days                   -7.019e-03  2.851e-03  -2.462  0.01405 *  
## Imp_thousand_barrels_per_day  3.146e-06  5.859e-06   0.537  0.59145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1405 on 700 degrees of freedom
## Multiple R-squared:  0.9733, Adjusted R-squared:  0.9731 
## F-statistic:  5101 on 5 and 700 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm17 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm17$fitted.values)

mae_train_lm17
## [1] 0.1046834

Model-18: Linear Regression Model considering refinery capacity and crude price

gasprice_lm18 <- lm(Gas_price_per_gallon ~ Utilization_percentage + Crude_price_per_barrel, data = gas_price_train)
summary(gasprice_lm18)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage + 
##     Crude_price_per_barrel, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56836 -0.14031 -0.05015  0.15903  0.73363 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.8741178  0.1574309   5.552    4e-08 ***
## Utilization_percentage -0.0016266  0.0016771  -0.970    0.332    
## Crude_price_per_barrel  0.0286561  0.0002954  97.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2093 on 703 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9403 
## F-statistic:  5550 on 2 and 703 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm18 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm18$fitted.values)

mae_train_lm18
## [1] 0.1700321

Model-19: Linear Regression Model considering only refinery capacity as independent variable

gasprice_lm19 <- lm(Gas_price_per_gallon ~ Utilization_percentage, data = gas_price_train)
summary(gasprice_lm19)
## 
## Call:
## lm(formula = Gas_price_per_gallon ~ Utilization_percentage, data = gas_price_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91231 -0.68783 -0.05725  0.61916  2.03036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.988488   0.528051   15.13   <2e-16 ***
## Utilization_percentage -0.063701   0.005876  -10.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7934 on 704 degrees of freedom
## Multiple R-squared:  0.143,  Adjusted R-squared:  0.1418 
## F-statistic: 117.5 on 1 and 704 DF,  p-value: < 2.2e-16

Compute MAE for the fitted model on training data

mae_train_lm19 <- MAE(gas_price_train$Gas_price_per_gallon, gasprice_lm19$fitted.values)

mae_train_lm19
## [1] 0.6769605
maes <- c(mae_train_lm1,
          mae_train_lm2,
          mae_train_lm3,
          mae_train_lm4,
          mae_train_lm5,
          mae_train_lm6,
          mae_train_lm7,
          mae_train_lm8,
          mae_train_lm9,
          mae_train_lm10,
          mae_train_lm11,
          mae_train_lm12,
          mae_train_lm13,
          mae_train_lm14,
          mae_train_lm15,
          mae_train_lm16,
          mae_train_lm17,
          mae_train_lm18,
          mae_train_lm19)

maes
##  [1] 0.1031485 0.3571621 0.3745696 0.4032406 0.4293090 0.4858075 0.1034463
##  [8] 0.1046638 0.1053172 0.1052105 0.1704284 0.1335655 0.6063415 0.6064749
## [15] 0.6472869 0.7255790 0.1046834 0.1700321 0.6769605
rank(maes)
##  [1]  1 10 11 12 13 14  2  3  6  5  9  7 15 16 17 19  4  8 18

So, lm1 has the lowest MAE followed by lm7 and then lm8.

Let’s see how these three models do in predicting the test data values.

pred_lm1 <- predict(gasprice_lm1, newdata = gas_price_test)
MAE(pred_lm1, gas_price_test$Gas_price_per_gallon)
## [1] 0.1081121
pred_lm7 <- predict(gasprice_lm7, newdata = gas_price_test)
MAE(pred_lm7, gas_price_test$Gas_price_per_gallon)
## [1] 0.1084135
pred_lm8 <- predict(gasprice_lm8, newdata = gas_price_test)
MAE(pred_lm8, gas_price_test$Gas_price_per_gallon)
## [1] 0.1096322

Model diagnostics

Scatterplots of actual vs fitted values

For our top 3 models, we have created a scatter plot showing actual vs fitted values of Gas Price.

Here are the scatters on the training data.

scatter_train_df <- data.frame(gasprice_actual = gas_price_train[,"Gas_price_per_gallon"],
                          lm1 = gasprice_lm1$fitted.values,
                          lm7 = gasprice_lm7$fitted.values,
                          lm8 = gasprice_lm8$fitted.values)
ggplot(scatter_train_df) + geom_point(aes(x=lm1, y=gasprice_actual)) + geom_abline(color='red')

ggplot(scatter_train_df) + geom_point(aes(x=lm7, y=gasprice_actual)) + geom_abline(color='red')

ggplot(scatter_train_df) + geom_point(aes(x=lm8, y=gasprice_actual)) + geom_abline(color='red')

The scatters confirm the relative rankings of our top 3 models

scatter_test_df <- data.frame(gasprice_actual = gas_price_test[,"Gas_price_per_gallon"],
                          lm1 = pred_lm1,
                          lm7 = pred_lm7,
                          lm8 = pred_lm8)
head(scatter_test_df)
##    gasprice_actual      lm1      lm7      lm8
## 5            3.480 3.498988 3.497462 3.472433
## 7            3.321 3.325307 3.319293 3.304099
## 8            3.271 3.304913 3.304877 3.281417
## 10           3.216 3.103842 3.113279 3.091159
## 11           3.211 3.019522 3.022481 2.992055
## 13           3.252 3.026161 3.029309 2.985269
ggplot(scatter_test_df) + geom_point(aes(x=lm1, y=gasprice_actual)) + geom_abline(color='red')

ggplot(scatter_test_df) + geom_point(aes(x=lm7, y=gasprice_actual)) + geom_abline(color='red')

ggplot(scatter_test_df) + geom_point(aes(x=lm8, y=gasprice_actual)) + geom_abline(color='red')

The scatters on test look better than the scatters on train.

Constant variance

ggplot(data.frame(residuals=gasprice_lm1$residuals,
                  fitted=gasprice_lm1$fitted.values)) + geom_point(aes(x=fitted, y=residuals))

The plot does not have constant variance as it is more of a fanned out shape which leads the selected top model towards being “Heteroskedastic”.

Make predictions for the test dataset

For each of our top 3 models, we made predictions for Gas Price using gas_price_test.

price_predict <- data.frame(predict1=predict(gasprice_lm1, newdata=gas_price_test),
                             predict7=predict(gasprice_lm7, newdata=gas_price_test),
                             predict8=predict(gasprice_lm8, newdata=gas_price_test),
                             price_actual=gas_price_test$Gas_price_per_gallon)

Evaluate the predictions

We then computed the MAE for each of the three models’ predictions on the test data.

price_predict_MAE <- data.frame(predict1_MAE=MAE(price_predict$predict1, gas_price_test$Gas_price_per_gallon),
                                predict7_MAE=MAE(price_predict$predict7, gas_price_test$Gas_price_per_gallon),
                                predict8_MAE=MAE(price_predict$predict8, gas_price_test$Gas_price_per_gallon))

price_predict_MAE
##   predict1_MAE predict7_MAE predict8_MAE
## 1    0.1081121    0.1084135    0.1096322

Model Comparison

The top model is “gasprice_lm1” for which we considered all the independent variables. The model makes sense in terms of the variables included as we can see from the correlation matrix the variable “Crude Price” had a strong positive correlation while “Stocks” had a strong negative correlation with Gas Price and same signs were depicted in the coeeficient estimate value for these variables. The reason behind selecting all variables is that almost all of them are significant variables except for import data variable and this model has the lowest MAE value for both training (0.1031485) and test set (0.1081121). Whereas the R-squared value is the highest 0.9736 being a good model fit.

str(gas_price_test$Date)
##  Date[1:300], format: "2022-02-18" "2022-01-28" "2022-01-21" "2021-12-31" "2021-12-24" ...
finalpredtestData<-data.frame(gas_price_test$Date,price_predict$predict7,gas_price_test$Gas_price_per_gallon )
colnames(finalpredtestData) <-c("Date","GasPricePredicted", "GasPriceActual")
ts_plot(finalpredtestData)

In the above time series plot showcases both the actual and predicted gas price values of our champion model we can observe that both the lines are following closely with each other across the years and proves that our model fits well. But for the year 2020 probably due to COVID as a factor which caused a drastic decrease in demand so in reality the gas prices went down and we can see the model’s predicted gas price was able to follow the trend but was just a little off by 0.70 cents.